Following the RNAi identity QC, we should already corrected all wrong clones and ready to move to the second QC step, sample qualities. Sample quality QC identifies bad samples that behave unexpected using Exploratory Data Analysis (EDA). For instance, a sample that decorrelates with the other two biological replicates for the same RNAi condition. This EDA based QC is a standard step of RNA-seq analysis and we follow the most common way - using PCA, correlation and distances. This walkthrough provides a convenient template to generate these QC plots for identifying bad samples.
Both standard and custom application of WPS should go through this sample quality check, and remove all bad samples before proceeding to the biological interpretation phase (such as a DE analysis).
# install if not in your environment
library(stringr)
library(ggplot2)
library("GGally")
library(DESeq2)
library(ggrepel)
library(pheatmap)
library(png)
The following code uses a wrapper function to generate EDA plots for each condition. Please note that we offer this function just for the convenience to seamlessly analyze WPS data. Please feel free to perform the EDA/sample quality check in your own way!
source('RNAi_QC_functions.R')
# we use met7 as an example
exps = c('met7')
libs = c('lib1','lib2','lib3','lib4','lib5','lib6')
# specify the ID (name in the treatment covariate) for control samples
controlID = 'x.vector' # in WPS, we conveniently call them 'vector'. If you use another name such as "control", change it to the actual name.
# loop through each plate and then each library
for (expID in exps){
for (libID in libs){
EDA_plots(expID,libID, controlID)
}
}
## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode
## converting counts to integer mode
A natural thought for bad quality samples are those appearing to be obviously different than its replicates and such difference is huge and cannot be explained by biological variations. In theory, we can train a classifier or use some cutoffs to programatically identify samples with such behavior. However, we also realize that samples can go awry in various ways and it is always good to take a look at the data before going into any biological interpretations. Therefore, we opted to ask user to interpret these QC plots manually as we did to be certain on how the experiment went. Let’s look at an example EDA result using met7-lib4.
The figures are saved in a single output pdf: met7_lib4_initial_EDA.pdf
The EDA results include three plots for each condition versus control:
An example of bad samples is x.him_14_rep2 in met7-lib4. We noticed that it appeared as a strong outlier in PCA and distance heatmap plot, and its correlation with the other two replicates are not as good as typical ones, especially with a decorrelation at highly expressed region. These together suggest excluding x.him_14_rep2 from the downstream analysis as it may have technical problem such as mRNA degradation in the experiment.
Notably, the frequency of bad samples is extremely low with our final WPS protocol. So, one should not expect removing many samples in the QC step. Instead, it is more for gaining confidence in the data and spot any unexpected phenomenon.
When inspecting the plots, one should interactively identify and label out the bad samples. These samples will be excluded in our next step of saving final cleaned dataset. For convenience, we also provide a wrapper function to seamlessly label out these samples.
# manually set the bad sample set for each library. Still run this code with empty bad set even if there is no bad samples. This is to be seamless and organized.
expID = 'met7'
libID = 'lib4'
badSet = c('x.him_14_rep2') # put all bad samples in the corresponding library here
# if there is no bad sample, put an empty badSet by badSet = c()
qLevels(expID, libID, badSet)
Now, we are good to do the final step of data cleaning/QC - save the clean data.